Skip to content

Conversation

@Yicong-Huang
Copy link
Contributor

@Yicong-Huang Yicong-Huang commented Jan 7, 2026

What changes were proposed in this pull request?

Add tests for PyArrow's pa.array type inference behavior. These tests monitor upstream PyArrow behavior to ensure PySpark's assumptions remain valid across versions.

The tests cover type inference across input categories:

  1. Nullable data - with None values
  2. Plain Python instances - list, tuple, dict (struct)
  3. Pandas instances - numpy-backed Series, nullable extension types, ArrowDtype
  4. NumPy array - all numeric dtypes, datetime64, timedelta64
  5. Nested types - list of list, list of struct, struct of struct, struct of list
  6. Explicit type specification - large_list, fixed_size_list, map_, large_string, large_binary

Types tested include:

Category Types Covered
Primitive int8/16/32/64, uint8/16/32/64, float16/32/64, bool, string, binary
Temporal date32, timestamp (s/ms/us/ns), time64, duration (s/ms/us/ns)
Decimal decimal128
Nested list_, struct, map_ (explicit only)
Large variants large_list, large_string, large_binary (explicit only)

Pandas extension types tested:

  • Nullable types: pd.Int8Dtype() ... pd.Int64Dtype(), pd.UInt8Dtype() ... pd.UInt64Dtype(), pd.Float32Dtype(), pd.Float64Dtype(), pd.BooleanDtype(), pd.StringDtype()
  • PyArrow-backed: pd.ArrowDtype(pa.int64()), pd.ArrowDtype(pa.float64()), pd.ArrowDtype(pa.large_string()), etc.

Why are the changes needed?

This is part of SPARK-54936 to monitor behavior changes from upstream dependencies. By testing PyArrow's type inference behavior, we can detect breaking changes when upgrading PyArrow versions.

Does this PR introduce any user-facing change?

No. This PR only adds tests.

How was this patch tested?

New unit tests added:

python -m pytest python/pyspark/tests/upstream/pyarrow/test_pyarrow_type_inference.py -v

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions
Copy link

github-actions bot commented Jan 7, 2026

JIRA Issue Information

=== Sub-task SPARK-54938 ===
Summary: Add tests for pa.array type inference
Assignee: Yicong Huang
Status: Open
Affected: ["4.2.0"]


This comment was automatically generated by GitHub Actions

@HyukjinKwon
Copy link
Member

cc @zhengruifeng

Copy link
Contributor

@zhengruifeng zhengruifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks so much, it is much cleaner.

inspiried by https://github.com/apache/spark/pull/53727/changes,
I think we need to also test following cases:
1, string: non-english values;
2, integral: min max values, and make it overflow
3, floats: nan, -inf, inf
4, time: Unix epoch, min max values

@Yicong-Huang Yicong-Huang force-pushed the SPARK-54938/test/add-tests-for-pa-array-type-inference branch from efbc505 to 905b616 Compare January 9, 2026 21:34
@Yicong-Huang Yicong-Huang force-pushed the SPARK-54938/test/add-tests-for-pa-array-type-inference branch from 905b616 to f12b2a0 Compare January 9, 2026 21:37
# unittests for upstream projects
"pyspark.tests.upstream.pyarrow.test_pyarrow_ignore_timezone",
"pyspark.tests.upstream.pyarrow.test_pyarrow_scalar_type_inference",
"pyspark.tests.upstream.pyarrow.test_pyarrow_type_inference",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's rename the file test_pyarrow_array_type_inference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed.

Copy link
Contributor

@zhengruifeng zhengruifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise, LGTM

@zhengruifeng
Copy link
Contributor

thanks, merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants